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Abstract — Over binary input channels, uniform distribution is 
a universal prior, in the sense that it allows to maximize the 
worst case mutual information over all binary input channels, 
ensuring at least 94.2% of the capacity. In this paper, we address 
a similar question, but with respect to a universal generalized 
linear decoder. We look for the best collection of finitely many 
a posteriori metrics, to maximize the worst case mismatched 
mutual information achieved by decoding with these metrics 
(instead of an optimal decoder such as the Maximum Likelihood 
(ML) tuned to the true channel). It is shown that for binary 
input and output channels, two metrics suffice to actually achieve 
the same performance as an optimal decoder. In particular, this 
implies that there exist a decoder which is generalized linear 
and achieves at least 94.2% of the compound capacity on any 
compound set, without the knowledge of the underlying set. 

I. Introduction 

We consider the problem where a communication system 
is to be designed without explicit knowledge of the channel. 
Here, neither the transmitter nor the receiver have access to 
the exact channel law. The goal is to devise a single coding 
strategy which will ensure reliable communication over the 
unknown channel picked for transmission. We assume i.i.d. 
realizations of the channel at each channel use, i.e., we are 
interested in a universal coding framework for communicating 
over discrete memoryless channels (DMC). In this paper, 
we present results for DMC's with binary input and output 
alphabets, which we refer to as binary memoryless channels 
(BMC). Our goal is to design decoders which have a linear 
structure and which entails reliable communication at the 
largest possible rate in this setting. In the next section, we 
revise in more details the notion of universality and linearity 
for DMC's, and their attribute. We will then formulate our 
problem as a game where the decoder has to pick the decoding 
metrics, i.e., a generalized linear decoder, before nature select 
a channel for communication. 

A. Universality 

If the channel over which communication takes place is 
unknown at both the transmitter and the receiver but belongs 
to a set of DMC's S, then we are in the setting of compound 
channels. Let us denote by X the input alphabet and y 
the output alphabet of any DMC in S. The objective is to 
design a code (i.e., an encoder and decoder pair) which will 
provide a mean for reliable communication, independently of 
which W 6 S is picked up (by nature) for transmission. The 

R.Pulikkoonattu was with EPFL. He has been with Broadcom since October 
2009. 



compound channel problem has been extensively studied in 
the literature such as 0,(3, 0,03,03,11161 and ifTTl . 

The highest achievable rate, known as the compound ca- 
pacity C(S) of a set S of channels is established in [5| and 
is given by: 

C(S) = max inf I(P, W) (1) 

P WES 

where the maximization is over all possible probability distri- 
butions P on X, and the infimum is performed over all the 
channels W in the compound set S. 

In 0, a decoder that maximizes a uniform mixture of 
likelihoods, over a dense set of possible channels is proposed 
as a universal decoder. In the literature, a decoder which allows 
us to achieve the same random coding exponent (without 
the knowledge of true channel) as the maximum likelihood 
(ML) decoder tuned to the true channel is called a universal 
decoder. The maximum mutual information (MMI) decoder 
introduced in [13| is a universal decoder. The MMI decoder 
computes the empirical mutual information (EMI) between a 
given received output and each codewords in the codebook, 
and declares the element with the highest EMI score as the 
sent codeword. There has been a number of other universal 
decoders proposed in the literature, such as Lempell-Ziv (LZ) 
based algorithm 11231 . and the merged likelihood decoder [ 12 1. 
The MMI decoder has another interesting feature: it does 
not even require the knowledge of the compound set to be 
defined. In that sense, the MMI decoder is a "doubly universal" 
decoder. However, practical use of MMI decoders are voided 
by complexity considerations, and similarly for other existing 
universal decoders. Note that in this paper, we are primarily 
concerned with the achievable rate rather than error exponent. 

B. Linearity 

A linear (or additive) decoder is defined to have the fol- 
lowing structure. Upon receiving each n— symbol output y, 
the decoder computes a score (decoding metric) d n (x m ,y) 
for each codeword x m ,m — 1,2, . . .2 nR and declares the 
codeword with the highest score as estimate of the sent code- 
word(ties are resolved arbitrarily). Moreover, the n— symbol 
decoding metric has the following additive structure. 

n 

d n (x, y ) = J2 d (*(<). i/(0)» v ^ y e * n , y n 

1=1 

where d : X x y — >• R is a single letter decoding metric. We 
call such decoders as linear decoders since the decoding metric 
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it computes is indeed linear in the joint empirical distribution 
between the codeword and the received word: 

d n (x,y)=n ^2 P(x, y ){u,v)d(u,v) 
uex,vey 

where P( x , y ) {u, v) denotes the joint empirical distribution of 
(x,y). We then say that the linear decoder is induced by the 
single letter metric d. A significant body of work on the topic 
of linear decoders exist in the literature. We refer to [2|,|9|, and 
the references within, for a detailed review of such decoders. 
Examples of linear decoders are the maximum likelihood (ML) 
or the maximum a posteriori (MAP) decoders, whereas the 
MMI decoder is not linear. 

The main advantage of a linear decoder is that, when used 
with appropriately structured codes, it allows significant reduc- 
tion in the decoding complexity. In this regard, a convenient 
structure is that of linear encoders, which produce codewords 
out of a linear transformation 

x = Gu, 

where u G X nR contains the information bits and G G X nxnR 
is the generating matrix. With such encoders, linear decoders 
allow the use of techniques such as the Viterbi algorithm, 
where significant complexity reduction is made possible in the 
optimum decoding (e.g. maximum likelihood sequence decod- 
ing (MLSE)) of convolutional codes, or the message-passing 
(belief propagation) algorithms adopted for the decoding of 
several modern coding schemes ETTl . 

The expected reduction in decoding complexity discussed so 
far is possible only when the code is appropriately structured. 
However, for the proof of existence of linear universal de- 
coders in this paper, we rely on the random coding argument, 
where one establish the existence of a deterministic code 
which yields good performance, without explicitly construct- 
ing the code. For the complexity claims, one still needs to 
investigate on whether appropriately structured encoders, can 
realize the performance guaranteed by the random coding 
argument. However, from an argument of Elias 1111 . we 
already know that this is possible for binary symmetric chan- 
nels, where it is sufficient to consider random binary codes 
to achieve the performance of arbitrary random codes; this 
argument is based on the fact that pairwise independence 
between the codewords is a sufficient assumption which has 
been further generalized in (6). 

A class of decoders slightly more general than linear 
decoders and called generalized linear decoders in Q is 
defined as follows. Instead of requiring the condition that 
the decoding metric is additive, it is only required that the 
score function breaks into the maximization of a finitely 
many additive metrics, cf. Definition 2 in [2|. The purpose 
of studying generalized linear decoders is that all properties 
mentioned above for linear decoders still hold for generalized 
decoders. 

C. Linear universal decoders 

In view of constructing universal codes of manageable 
complexity, a first legitimate question is to ask wether linear 



universal decoders can be constructed. Not surprisingly, some 
compound channels do not admit a linear universal decoder. 
It is known that, when the set S is convex and compact, 
the maximum likelihood, tuned to the channel offering the 
least mutual information (for the optimal input distribution in 
(Q}) serves as a compound capacity achieving decoder [9|. In 
DD , ED, it is shown that this result still holds if the set S is 
non convex but one-sided, cf. Definition 4 in O. Moreover, 
authors in [2| show that if S is a finite union of one-sided 
sets, a generalized linear decoder that achieves the compound 
capacity can be constructed. 

In this paper, we construct a generalized linear decoder 
which achieve 94% of the compound capacity on any BMC 
compound set, by using the same two metrics, chosen irre- 
spective of the given compound set. 

The remainder of this paper is organized as follows. We 
review known results for DMC's and then introduce the nota- 
tions in the next section. The problem statement is discussed 
in section [Hi] We then present the main results for BMC's in 
section [TV] 

II. Known results for DMC 

We consider discrete memoryless channels with input al- 
phabet X and output alphabet y. A DMC is described by 
a probability transition matrix W, each row of which is the 
conditional probability distribution of the output y given input 
X. We denote by S, a compound set of DMC's. While the 
set of channels is known to both the transmitter and the 
receiver, the exact channel of communication, denoted by Wo, 
is unknown to them. 

We assume that the transmitter and receiver operate syn- 
chronously over blocks of n symbols. In each block, a message 
to G {l, 2, . . . , 2 nR \ is mapped by an encoder 

F n :={l,2,...,2 nR }^X n = {0,l} n 

to F n (m) = x m , referred to as the m codeword. The receiver 
upon observing a word, drawn from the distribution 

n 

W n (y\x m ) = l[W(y(i)\x m (i)), 

8=1 

applies a decoding map 

G n :y n ^{l,2,...,2 nR }. 

The average probability of error, averaged over a given code 
(F n ,G n ) for a specific channel W, is expressed as 

2- 

p e (F n , G n , w) = ¥m j2 E wn (vM- 

™=1 y.G n (y)^m 

We say that a rate R is achievable for a given compound set 
S, if for any e > 0, there exist a block length n and code with 
rate at least R, such that, for all W G S, P e (F n ,G n , W) < e. 
The supremum of such available rates is called the compound 
channel capacity, denoted as C(S). Blackwell et al. Q, 
formulated the expression for compound channel capacity as 

C(S) = max inf / (P, W) . 

p wes 
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Before proceeding with the problem statement, we will 
introduce a few notations and some useful definitions. 

Definition 1: (Generalized Linear Decoder) 
Let di, d 2 , • • • , d.K be K single letter metrics, where K is a 
finite integer. A generalized linear decoder induced by these 
metrics is defined by the decoding map: 



G n (y) 



argmax vf =1 V 4 (x m (i),y(i)) 

m ■ ■ 



iK 



where V denotes the maximum operator and P( Xm , y ) denotes 
the joint empirical distribution of (x m , y). 
An example of generalized linear decoder is the generalized 
likelihood ratio test (GLRT) which, for a given collection of 
channels Wi, . . . , Wk, is induced by the metrics 

logW k (v\u), VueX,vey,k = l,...,K. 

We now denote by P an input distribution. For a channel 
denoted by Wk, we use fik to denote the joint distribution of 
an input and output pair through Wk, i.e., fik = P ° Wk- We 
denote by the product measure of the marginal input dis- 
tribution P and the marginal output distribution (pk)y(y) — 
J2xex ^k(x,y). Hence, 

I(P,W k ) = D(fikhl). 

Lemma 1: When the true channel is Wq and a generalized 
linear decoder induced by the single-letter metrics {dk} k —i is 
used, we can achieve the following rate 

K 

WPx,Wo,{4}f=i) = V mi ? 

where 

A k = {p:^ = fJ? , Ep[d k ] > vf =1 E M0 [dj]}, Vl<k<K. 

In particular, if k = 1, this is the mismatched result in [9| and 
Ull, and if k > 2, it is a consequence of [9|,[ 19], as discussed 
in J2]. Extensive coverage of the mismatched problem |[T6l 
appears in the literature O , El , fl9) , 031 , flT31 , JT^l as well as 
0. 

Definition 2 (one-sided sets): A set S of DMC's is one- 
sided with respect to an input distribution P, if Ws = 

arg min / (P, W) is unique and if 

wed(s) 

D0i\\^ s )>D(^)+D{n s \\^ s ) 

for any fi = P o W, where W G S and fig = P o Ws ■ A set 

S of DMCs is a union of one-sided sets if S — UjfL lt Sfc for 

some K < oo and if the 5fc's are one-sided with respect to 

P* = argmax inf I(P, W). 
p wes 

K 

Theorem 1: For a compound set S = Sk which is a 

fe=i 

union of one-sided sets, the generalized linear decoder induced 
by the metrics 

W Sk 



where W$, — arg min I(P*,W), is compound capacity 

weci(s k ) 

achieving. Moreover, if the true channel Wq is known to 
belong to Sk, this decoder allows to achieve the rate, 



W-P*,Wo,logWs, 



(2) 



Notice that, this decoder requires the full knowledge of 
the compound set. In other words, knowing the compound 
capacity as well as the compound capacity achieving input 
distribution (which for instance, suffices for MMI) will not 
suffice to construct such a decoder. One can interpret this 
decoder as follows: the MMI decoder allows to achieve com- 
pound capacity on any compound set, by taking a "degenerate 
generalized linear" decoder with infinitely (and uncountably) 
many metrics: the a posteriori metrics of all possible DMCs 
with the given alphabets. Hence, there is no linear property 
(and consequent benefit) for such a decoder. However, what 
Theorem Q] says, is that, since we have the knowledge of 
the compound set, we can use it to tune a generalized linear 
decoder which will still achieve compound capacity by picking 
only the important channels and corresponding a posteriori 
metrics. Our goal in this paper is to investigate whether further 
simplification in the above generalized linear decoder can be 
made when restricted to BMC's. More specifically, we address 
the possibility of building a universal decoder tuned to metrics, 
chosen independently to the given compound BMC set. 

Using the symmetry properties occurring in the BMC's, we 
have the following result (cf. Q~)). 

Lemma 2: Let Pi , Pi be two stochastic matrices of size 
2x2, such that det(PiP2) > 0. Let C a set of binary vectors 
of length n with fixed composition. For any x\,x% G C and 
y G {0, 1}", we respectively have 

Pi(y\ Xl ) > (=)P 1 (y\x 2 ) => P 2 (y\x x ) > (=)P 2 {y\x 2 ). 

Note that the condition dct(PiP 2 ) > simply means that Pi 
and P 2 have their maximal value within a column at the same 
place. 

Hence, we would like to investigate whereas for BMC, the 
generalized linear decoder of Theorem [T] can be simplified, 
so as to require only few metrics and independently of the 
compound set. 

III. Problem statement 

A. The a and ft game 

From now on, we only consider BMC's. 
Let the parameter a be defined as 

I(P,W) 



a = max 



inf 



(3) 



p weBMC C(W) 
and the distribution P which achieve a is denoted by P op t, 

I(P,W) 



i.e., 



Popt := argmax inf 



(4) 



log 



1 < k < K 



P W£BMC C(W) 

The term a is the ratio of the maximum achievable rate 
to the channel capacity, for the worst possible channel in the 
compound set, when a single input distribution is chosen. 

Let K G Z+ and let J m i s (^P op t, Wq, {dk}f^j be the achiev- 
able mismatched rate on a channel Wq, using a generalized 
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linear decoder induced by the K metrics {dk}f . The expres- 
sion of 7 m i s (P pt,Wo, {dk}^j is given in the next section, 
and is proved to be an achievable rate in [9|,[|T9] as well as 
discussed in 0. Indeed, since we are working with binary 
input (and output) channels, this mismatched achievable rate 
is equal to the mismatched capacity. We then define another 
parameter (3 k by 



h 



(3 K ■= inf 

W €BMC 



MIS 



(p opt; Wo,{4}f 



(5) 



I(P op t,W ) 

Clearly < a, f3 < 1. The problem of finding a has already 
been solved in [22], and the answer is a » 0.942, as reviewed 
below. From Theorem [TJ (second part), one can show that by 
taking a large enough K, we can make (3k arbitrarily close to 
1. Indeed, this relates to the fact that we can approximate the 
set of DMC's by a covering of one-sided components (for the 
uniform input distribution). Hence, one can study the speed 
convergence of (3k, in K, to deduce how many metrics in 
magnitude need to be used to achieve a given performance. We 
believe this is an interesting problem, as it captures the cost (in 
the number of additive metrics) needed to "replace" the MMI 
decoder with a generalized linear decoder; and this problem 
can be addressed for any alphabet dimensions. However, as 
motivated in previous section with Proposition |2j we hope to 
get exactly (32 — 1 for the binary alphabets setting, in which 
case we do not need to investigate the speed convergence 
problem (this will indeed be the case). 

IV. Results 

A. Optimal input distribution 

The optimization problem for a has been investigated in 
11221 with the following answer. 

Theorem 2 (Shulman and Feder): a ~ 0.942 and P opt is 
the uniform distribution. 

The authors also identified the worst channel to be a Z- 
channel. This result is also a ramification of the fact that with 
uniform source distribution, the maximum loss for any channel 
is less than 5.8% of capacity, as originally reported by Majani 
and Rumsey lfl8l . 

B. Optimal Generalized Linear Decoder 

We represent a BMC by a point in (a, b) £ [0, l] 2 , with the 
following mapping to specify the BMC 

a 1 cl 
l-b b 



< a,b < 1. 



Definition 3: Let B~ = {(a,b) £ [0,l] 2 |a + b < 1} and 
B+ = {(a,b) £ [0, l] 2 |a + b > 1}, and let U denote the 
uniform binary distribution. 

Note that B~ parameterizes the set of BMC's which are 
flipping-like, in the sense that, assuming the input and output 
alphabets to be {0, 1}, for any given output y of a BMC in 
B~, it is more likely that the sent input is 1 + y (mod 2). 
Similarly, B + parameterizes the set of BMC's which are non- 
flipping-like, containing in particular the set of channels for 



which a + b = 1, which are all the pure noise channels (zero 
mutual information). 

Proposition 1: For any i £ { — , +} and any Wq £ B l , we 
have 



/mis (U,W Q ,logWi) 



I (U, W ) if Wi S B l 
otherwise. 



This proposition tells us that, as long as the channel used 
for the decoding (W\) is in the same class (B~ or B + ) as 
the true channel (Wq), the mismatched mutual information 
is equal to the mutual information itself; i.e., equal to all 
of the mutual information being evaluated with the uniform 
input distribution. If instead the channel and the metrics each 
other hail from different class, then the mismatched mutual 
information is zero. 

Proof: As defined in Proposition (JTJ, we have 



/mis (U, Wq, log Wi) = inf D (n\\$) 



(6) 



where A = {fi : pP = /xg, E M log Wi > E^logWi}. Note 
that the channels W which induce a fi such that yP — /i[J is 
parameterized in [0, l] 2 by the line passing through [i^ with a 
slope of 1. Hence, since /io G dA (the boundary of A), it is 
easy to verify that the region A is the segment starting at /xo 
and going either up or down (with slope 1). This leads to two 
possibilities, either /1q £ A and © is 0, or /!q ^ A and the 
minimizer of d6) is /iq, implying the claims of Proposition [TJ 

■ 

Proposition 2: For any binary input/output channel Wq 
and for any binary symmetric channel W\, i.e., VKi (0 1 0) = 
Wi(l|l), we have 

/mis (U, Wo, {log W^logWx}) = / (C7, W ) , 

where Wi is the BSC defined by W^O] 0) = 1 - Wi(G|0). 
Proof: As defined in Lemma ((TJ, we have 

2 

4ns (u^ilogWulogWx}) = W inf D((i\\fjP ), (7) 

k—l 

where A k = {fi : ^ = ^,^\ogW k > V 2 =1 E M logWj}, 
k = 1,2, where W% = W\. Note that, although in general 
taking the likelihood metrics as opposed to the a posteriori 
metrics makes an important difference when defining a gen- 
eralized linear decoder (cf. (2)), here it does not, since we are 
working with BSC channels for the metrics. Assume w.l.o.g. 
that WojW^i £ B~ . Then, a straightforward computation 
shows that 

2 

V^logWj =E W logWi. 

Hence 

inf D^\\fj, p )=I WS (U,W ,logWx), 

and from Proposition [TJ 

/mis (U, Wo, log Wi) =I(U, W ) . 
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Moreover, note that for any channel Wo, if we define Wq 
to be the reverse BSC (cf. Figure [TJ, and fto, Mo to be the 
corresponding measures, we have jUq = jt/g, 



and 



E„ logWi =E Po logWi 



Ai = : n P = //g,E„logW 2 



{li:^= £g, E M log = E Po log ^}. 



Therefore, 



inf D(p\\i%) = /mis f U, Wo, Wi) = /mis (17, W , Wi) 
and both terms in the RHS of © are equal to I (U, Wo). 

(0,1) 



a i 
ax 




(0,0) 



(1,0) 



Fig. 1. Reverse channels in the BMC setting 

An extended discussion and alternate proofs of previous 
results can be found in [20|. 

Corollary 1: We have fa = 1, which is achieved by picking 
any two metrics of the form d% = logWi, cl 2 = logWi, as 
long as W\ is a BSC (and W\ its reverse BSC). 

Corollary 2: For any compound sets S, 94.2% of the com- 
pound capacity can be achieved by using a generalized linear 
decoder induced by two metrics. Moreover, if the optimal input 
distribution (achieving compound capacity) is uniform, we can 
achieve 100% of the compound capacity with two metrics. 

Note: if the optimal input distribution is non-uniform, it may 
still be possible to achieve 100% of the compound capacity 
with two metrics, but the above results will have to be adapted 
to the non-uniform input distribution case. 

V. Discussion 

In this paper, we have shown that, for binary input binary 
output memoryless channels, compound capacity achieving 
decoders can have a much simpler structure than the Maximum 
Mutual Information (MMI) decoder. These decoders, namely 
the generalized linear decoders, preserve many features of 
the MMI decoder. When the input distribution is chosen 
to be uniform, a generalized linear decoder, using channel 
independent metrics (i.e., the metrics are selected without 



being aware of the channel rule) is shown to achieve the same 
rate as that of an optimum decoder (based on the exact channel 
rule). On the other hand, for any arbitrary compound BMC, at 
least 94.2% of the compound capacity can be realized by such 
a decoder. Finally, a natural extension of this work would be to 
investigate the case of non-binary alphabets. Even for binary 
input and ternary outputs, it does not seem straightforward 
to establish whether fax can be made exactly 1, for K large 
enough (although one can show that it must tend to 1 using 
results from |2|). 
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